Introduction

In my vectorization using .NET APIs blog, I describe SIMD datatypes Vector64<T> and Vector128<T> that operates on ‘Arm64 hardware intrinsic’ APIs present under System.Runtime.Intrinsics.Arm.AdvSimd and System.Runtime.Intrinsics.Arm.AdvSimd.Arm64 class. In this post I will describe those hardware intrinsic APIs by showing sample code usage along with examples and generated Arm64 code. This will help people in understanding these APIs so they can use them to optimize their .NET code written to target Arm64. Since there are 360 APIs, describing all of them in a single post will be overwhelming. So I have divided these APIs among 8 blogs and will demonstrate 45 APIs in each blog. This is part 3 of that blog series. You can checkout my previous blogs at:

Most of the description of these APIs is adapted and referenced from Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile document. You can also refer to the description of SIMD and Floating-point instructions description at Arm developer docs page.

The blog page is programmatically generated and might contain mistakes. If you find any mistake, please leave a comment and I will address it.

APIs covered

ConvertToUInt32RoundToPositiveInfinityScalar ExtractNarrowingSaturateUnsignedLower
ConvertToUInt32RoundToZero ExtractNarrowingSaturateUnsignedScalar
ConvertToUInt32RoundToZeroScalar ExtractNarrowingSaturateUnsignedUpper
ConvertToUInt64RoundAwayFromZero ExtractNarrowingSaturateUpper
ConvertToUInt64RoundAwayFromZeroScalar ExtractNarrowingUpper
ConvertToUInt64RoundToEven ExtractVector128
ConvertToUInt64RoundToEvenScalar ExtractVector64
ConvertToUInt64RoundToNegativeInfinity Floor
ConvertToUInt64RoundToNegativeInfinityScalar FloorScalar
ConvertToUInt64RoundToPositiveInfinity FusedAddHalving
ConvertToUInt64RoundToPositiveInfinityScalar FusedAddRoundedHalving
ConvertToUInt64RoundToZero FusedMultiplyAdd
ConvertToUInt64RoundToZeroScalar FusedMultiplyAddByScalar
Divide FusedMultiplyAddBySelectedScalar
DivideScalar FusedMultiplyAddNegatedScalar
DuplicateSelectedScalarToVector128 FusedMultiplyAddScalar
DuplicateSelectedScalarToVector64 FusedMultiplyAddScalarBySelectedScalar
DuplicateToVector128 FusedMultiplySubtract
DuplicateToVector64 FusedMultiplySubtractByScalar
Extract FusedMultiplySubtractBySelectedScalar
ExtractNarrowingLower FusedMultiplySubtractNegatedScalar
ExtractNarrowingSaturateLower FusedMultiplySubtractScalar
ExtractNarrowingSaturateScalar  

1. ConvertToUInt32RoundToPositiveInfinityScalar

Vector64<uint> ConvertToUInt32RoundToPositiveInfinityScalar(Vector64<float> value)

This method converts each element in the value vector from a floating-point to an unsigned integer value using the Round towards Plus Infinity rounding mode, stores in the result vector and returns the result vector.

private Vector64<uint> ConvertToUInt32RoundToPositiveInfinityScalarTest(Vector64<float> value)
{
  return AdvSimd.ConvertToUInt32RoundToPositiveInfinityScalar(value);
}
// value = <11.5, 12.5>
// Result = <12, 0>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt32RoundToPositiveInfinityScalarTest(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[UInt32]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtpu  s16, s0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

2. ConvertToUInt32RoundToZero

Vector64<uint> ConvertToUInt32RoundToZero(Vector64<float> value)

This method converts each element in the value vector from a floating-point to an unsigned integer value using the Round to Nearest with toward zero rounding mode, stores in the result vector and returns the result vector.

private Vector64<uint> ConvertToUInt32RoundToZeroTest(Vector64<float> value)
{
  return AdvSimd.ConvertToUInt32RoundToZero(value);
}
// value = <11.5, 12.5>
// Result = <11, 12>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<uint> ConvertToUInt32RoundToZero(Vector128<float> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt32RoundToZeroTest(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[UInt32]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtzu  v16.2s, v0.2s
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

3. ConvertToUInt32RoundToZeroScalar

Vector64<uint> ConvertToUInt32RoundToZeroScalar(Vector64<float> value)

This method converts each element in the value vector from a floating-point to an unsigned integer value using the Round to Nearest with toward zero rounding mode, stores in the result vector and returns the result vector.

private Vector64<uint> ConvertToUInt32RoundToZeroScalarTest(Vector64<float> value)
{
  return AdvSimd.ConvertToUInt32RoundToZeroScalar(value);
}
// value = <11.5, 12.5>
// Result = <11, 0>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt32RoundToZeroScalarTest(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[UInt32]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtzu  s16, s0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

4. ConvertToUInt64RoundAwayFromZero

Vector128<ulong> ConvertToUInt64RoundAwayFromZero(Vector128<double> value)

This method converts each element in the value vector from a floating-point to a 64-bits unsigned integer value using the Round to Nearest with Ties to Away rounding mode, stores in the result vector and returns the result vector.

private Vector128<ulong> ConvertToUInt64RoundAwayFromZeroTest(Vector128<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundAwayFromZero(value);
}
// value = <11.5, 12.5>
// Result = <12, 13>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundAwayFromZeroTest(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtau  v16.2d, v0.2d
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

5. ConvertToUInt64RoundAwayFromZeroScalar

Vector64<ulong> ConvertToUInt64RoundAwayFromZeroScalar(Vector64<double> value)

This method converts each element in the value vector from a floating-point to a 64-bits unsigned integer value using the Round to Nearest with Ties to Away rounding mode, stores in the result vector and returns the result vector.

private Vector64<ulong> ConvertToUInt64RoundAwayFromZeroScalarTest(Vector64<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundAwayFromZeroScalar(value);
}
// value = <11.5>
// Result = <12>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundAwayFromZeroScalarTest(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtau  d16, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

6. ConvertToUInt64RoundToEven

Vector128<ulong> ConvertToUInt64RoundToEven(Vector128<double> value)

This method converts each element in the value vector from a floating-point to a 64-bits unsigned integer value using the Round to Nearest rounding mode, stores in the result vector and returns the result vector.

private Vector128<ulong> ConvertToUInt64RoundToEvenTest(Vector128<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToEven(value);
}
// value = <11.5, 12.5>
// Result = <12, 12>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToEvenTest(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtnu  v16.2d, v0.2d
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

7. ConvertToUInt64RoundToEvenScalar

Vector64<ulong> ConvertToUInt64RoundToEvenScalar(Vector64<double> value)

This method converts each element in the value vector from a floating-point to a 64-bits unsigned integer value using the Round to Nearest rounding mode, stores in the result vector and returns the result vector.

private Vector64<ulong> ConvertToUInt64RoundToEvenScalarTest(Vector64<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToEvenScalar(value);
}
// value = <11.5>
// Result = <12>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToEvenScalarTest(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtnu  d16, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

8. ConvertToUInt64RoundToNegativeInfinity

Vector128<ulong> ConvertToUInt64RoundToNegativeInfinity(Vector128<double> value)

This method converts each element in the value vector from a floating-point value to a 64-bits unsigned integer value using the Round towards Minus Infinity rounding mode, and returns the result.

private Vector128<ulong> ConvertToUInt64RoundToNegativeInfinityTest(Vector128<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToNegativeInfinity(value);
}
// value = <11.5, 12.5>
// Result = <11, 12>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToNegativeInfinityTest(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtmu  v16.2d, v0.2d
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

9. ConvertToUInt64RoundToNegativeInfinityScalar

Vector64<ulong> ConvertToUInt64RoundToNegativeInfinityScalar(Vector64<double> value)

This method converts each element in the value vector from a floating-point value to a 64-bits unsigned integer value using the Round towards Minus Infinity rounding mode, and returns the result.

private Vector64<ulong> ConvertToUInt64RoundToNegativeInfinityScalarTest(Vector64<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToNegativeInfinityScalar(value);
}
// value = <11.5>
// Result = <11>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToNegativeInfinityScalarTest(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtmu  d16, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

10. ConvertToUInt64RoundToPositiveInfinity

Vector128<ulong> ConvertToUInt64RoundToPositiveInfinity(Vector128<double> value)

This method converts each element in the value vector from a floating-point value to a 64-bits unsigned integer value using the Round towards Plus Infinity rounding mode, and returns the result.

private Vector128<ulong> ConvertToUInt64RoundToPositiveInfinityTest(Vector128<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToPositiveInfinity(value);
}
// value = <11.5, 12.5>
// Result = <12, 13>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToPositiveInfinityTest(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtpu  v16.2d, v0.2d
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

11. ConvertToUInt64RoundToPositiveInfinityScalar

Vector64<ulong> ConvertToUInt64RoundToPositiveInfinityScalar(Vector64<double> value)

This method converts each element in the value vector from a floating-point value to a 64-bits unsigned integer value using the Round towards Plus Infinity rounding mode, and returns the result.

private Vector64<ulong> ConvertToUInt64RoundToPositiveInfinityScalarTest(Vector64<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToPositiveInfinityScalar(value);
}
// value = <11.5>
// Result = <12>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToPositiveInfinityScalarTest(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtpu  d16, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

12. ConvertToUInt64RoundToZero

Vector128<ulong> ConvertToUInt64RoundToZero(Vector128<double> value)

This method converts each element in the value vector from a floating-point value to a 64-bits unsigned integer value using the Round towards Zero rounding mode, and returns the result.

private Vector128<ulong> ConvertToUInt64RoundToZeroTest(Vector128<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToZero(value);
}
// value = <11.5, 12.5>
// Result = <11, 12>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToZeroTest(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtzu  v16.2d, v0.2d
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

13. ConvertToUInt64RoundToZeroScalar

Vector64<ulong> ConvertToUInt64RoundToZeroScalar(Vector64<double> value)

This method converts each element in the value vector from a floating-point value to a 64-bits unsigned integer value using the Round towards Zero rounding mode, and returns the result.

private Vector64<ulong> ConvertToUInt64RoundToZeroScalarTest(Vector64<double> value)
{
  return AdvSimd.Arm64.ConvertToUInt64RoundToZeroScalar(value);
}
// value = <11.5>
// Result = <11>

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ConvertToUInt64RoundToZeroScalarTest(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[UInt64]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fcvtzu  d16, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

14. Divide

Vector64<float> Divide(Vector64<float> left, Vector64<float> right)

This method divides the corresponding floating-point values in the left vector, by those in the right vector, stores the result in a result vector, and returns the result vector.

private Vector64<float> DivideTest(Vector64<float> left, Vector64<float> right)
{
  return AdvSimd.Arm64.Divide(left, right);
}
// left = <11.5, 12.5>
// right = <21.5, 22.5>
// Result = <0.53488374, 0.5555556>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> Divide(Vector128<double> left, Vector128<double> right)
Vector128<float> Divide(Vector128<float> left, Vector128<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:DivideTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fdiv    v16.2s, v0.2s, v1.2s
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

15. DivideScalar

Vector64<double> DivideScalar(Vector64<double> left, Vector64<double> right)

This method divides the corresponding floating-point values in the left vector, by those in the right vector, stores the result in a result vector, and returns the result vector.

private Vector64<double> DivideScalarTest(Vector64<double> left, Vector64<double> right)
{
  return AdvSimd.DivideScalar(left, right);
}
// left = <11>
// right = <3.1>
// Result = <3.5483873>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<float> DivideScalar(Vector64<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:DivideScalarTest(System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fdiv    d16, d0, d1
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

16. DuplicateSelectedScalarToVector128

Vector128<byte> DuplicateSelectedScalarToVector128(Vector64<byte> value, byte index)

This method creates a vector by duplicating the vector element at index in value vector into each element of the result vector. As seen in below example, the result vector elements count Vector128` is double that of input parameter count `Vector64`.

private Vector128<byte> DuplicateSelectedScalarToVector128Test(Vector64<byte> value, byte index)
{
  return AdvSimd.DuplicateSelectedScalarToVector128(value, 3);
}
// value = <11, 12, 13, 14, 15, 16, 17, 18>
// index = 3
// Result = <14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<short> DuplicateSelectedScalarToVector128(Vector64<short> value, byte index)
Vector128<int> DuplicateSelectedScalarToVector128(Vector64<int> value, byte index)
Vector128<float> DuplicateSelectedScalarToVector128(Vector64<float> value, byte index)
Vector128<sbyte> DuplicateSelectedScalarToVector128(Vector64<sbyte> value, byte index)
Vector128<ushort> DuplicateSelectedScalarToVector128(Vector64<ushort> value, byte index)
Vector128<uint> DuplicateSelectedScalarToVector128(Vector64<uint> value, byte index)
Vector128<byte> DuplicateSelectedScalarToVector128(Vector128<byte> value, byte index)
Vector128<short> DuplicateSelectedScalarToVector128(Vector128<short> value, byte index)
Vector128<int> DuplicateSelectedScalarToVector128(Vector128<int> value, byte index)
Vector128<float> DuplicateSelectedScalarToVector128(Vector128<float> value, byte index)
Vector128<sbyte> DuplicateSelectedScalarToVector128(Vector128<sbyte> value, byte index)
Vector128<ushort> DuplicateSelectedScalarToVector128(Vector128<ushort> value, byte index)
Vector128<uint> DuplicateSelectedScalarToVector128(Vector128<uint> value, byte index)

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> DuplicateSelectedScalarToVector128(Vector128<double> value, byte index)
Vector128<long> DuplicateSelectedScalarToVector128(Vector128<long> value, byte index)
Vector128<ulong> DuplicateSelectedScalarToVector128(Vector128<ulong> value, byte index)

See Microsoft docs here and here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:DuplicateSelectedScalarToVector128Test(System.Runtime.Intrinsics.Vector64`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;* V01 arg1         [V01    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            dup     v16.16b, v0.b[3]
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

17. DuplicateSelectedScalarToVector64

Vector64<byte> DuplicateSelectedScalarToVector64(Vector64<byte> value, byte index)

This method creates a vector by duplicating the vector element at index in value vector into each element of the result vector.

private Vector64<byte> DuplicateSelectedScalarToVector64Test(Vector64<byte> value, byte index)
{
  return AdvSimd.DuplicateSelectedScalarToVector64(value, 3);
}
// value = <11, 12, 13, 14, 15, 16, 17, 18>
// index = 3
// Result = <14, 14, 14, 14, 14, 14, 14, 14>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> DuplicateSelectedScalarToVector64(Vector64<short> value, byte index)
Vector64<int> DuplicateSelectedScalarToVector64(Vector64<int> value, byte index)
Vector64<float> DuplicateSelectedScalarToVector64(Vector64<float> value, byte index)
Vector64<sbyte> DuplicateSelectedScalarToVector64(Vector64<sbyte> value, byte index)
Vector64<ushort> DuplicateSelectedScalarToVector64(Vector64<ushort> value, byte index)
Vector64<uint> DuplicateSelectedScalarToVector64(Vector64<uint> value, byte index)
Vector64<byte> DuplicateSelectedScalarToVector64(Vector128<byte> value, byte index)
Vector64<short> DuplicateSelectedScalarToVector64(Vector128<short> value, byte index)
Vector64<int> DuplicateSelectedScalarToVector64(Vector128<int> value, byte index)
Vector64<float> DuplicateSelectedScalarToVector64(Vector128<float> value, byte index)
Vector64<sbyte> DuplicateSelectedScalarToVector64(Vector128<sbyte> value, byte index)
Vector64<ushort> DuplicateSelectedScalarToVector64(Vector128<ushort> value, byte index)
Vector64<uint> DuplicateSelectedScalarToVector64(Vector128<uint> value, byte index)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:DuplicateSelectedScalarToVector64Test(System.Runtime.Intrinsics.Vector64`1[Byte],ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;* V01 arg1         [V01    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            dup     v16.8b, v0.b[3]
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

18. DuplicateToVector128

Vector128<byte> DuplicateToVector128(byte value)

This method creates a vector by duplicating the value into each element in the result vector.

private Vector128<byte> DuplicateToVector128Test(byte value)
{
  return AdvSimd.DuplicateToVector128(value);
}
// value = 7
// Result = <7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<short> DuplicateToVector128(short value)
Vector128<int> DuplicateToVector128(int value)
Vector128<sbyte> DuplicateToVector128(sbyte value)
Vector128<float> DuplicateToVector128(float value)
Vector128<ushort> DuplicateToVector128(ushort value)
Vector128<uint> DuplicateToVector128(uint value)

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> DuplicateToVector128(double value)
Vector128<long> DuplicateToVector128(long value)
Vector128<ulong> DuplicateToVector128(ulong value)

See Microsoft docs here and here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:DuplicateToVector128Test(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   ubyte  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            uxtb    w0, w0
            dup     v16.16b, w0
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 28, prolog size 8

19. DuplicateToVector64

Vector64<byte> DuplicateToVector64(byte value)

This method creates a vector by duplicating the value into each element in the result vector.

private Vector64<byte> DuplicateToVector64Test(byte value)
{
  return AdvSimd.DuplicateToVector64(value);
}
// value = 5
// Result = <5, 5, 5, 5, 5, 5, 5, 5>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> DuplicateToVector64(short value)
Vector64<int> DuplicateToVector64(int value)
Vector64<sbyte> DuplicateToVector64(sbyte value)
Vector64<float> DuplicateToVector64(float value)
Vector64<ushort> DuplicateToVector64(ushort value)
Vector64<uint> DuplicateToVector64(uint value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:DuplicateToVector64Test(ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   ubyte  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            uxtb    w0, w0
            dup     v16.8b, w0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 28, prolog size 8

20. Extract

byte Extract(Vector64<byte> vector, byte index)

This method extracts an element from vector at index and returns it.

private byte ExtractTest(Vector64<byte> vector, byte index)
{
  return AdvSimd.Extract(vector, 3);
}
// vector = <11, 12, 13, 14, 15, 16, 17, 18>
// index = 3
// Result = 14

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
short Extract(Vector64<short> vector, byte index)
int Extract(Vector64<int> vector, byte index)
sbyte Extract(Vector64<sbyte> vector, byte index)
float Extract(Vector64<float> vector, byte index)
ushort Extract(Vector64<ushort> vector, byte index)
uint Extract(Vector64<uint> vector, byte index)
byte Extract(Vector128<byte> vector, byte index)
double Extract(Vector128<double> vector, byte index)
short Extract(Vector128<short> vector, byte index)
int Extract(Vector128<int> vector, byte index)
long Extract(Vector128<long> vector, byte index)
sbyte Extract(Vector128<sbyte> vector, byte index)
float Extract(Vector128<float> vector, byte index)
ushort Extract(Vector128<ushort> vector, byte index)
uint Extract(Vector128<uint> vector, byte index)
ulong Extract(Vector128<ulong> vector, byte index)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractTest(System.Runtime.Intrinsics.Vector64`1[Byte],ubyte):ubyte
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;* V01 arg1         [V01    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            umov    w0, v0.b[3]
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

21. ExtractNarrowingLower

Vector64<byte> ExtractNarrowingLower(Vector128<ushort> value)

This method narrows each element in the value vector to half the original width, stores the result into a result vector and returns the vector. As seen in below example, the result vector element’s size byte is half as long as that of input parameter element’s size ushort.

private Vector64<byte> ExtractNarrowingLowerTest(Vector128<ushort> value)
{
  return AdvSimd.ExtractNarrowingLower(value);
}
// value = <300, 12, 413, 514, 15, 216, 117, 618>
// Result = <44, 12, 157, 2, 15, 216, 117, 106>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> ExtractNarrowingLower(Vector128<int> value)
Vector64<int> ExtractNarrowingLower(Vector128<long> value)
Vector64<sbyte> ExtractNarrowingLower(Vector128<short> value)
Vector64<ushort> ExtractNarrowingLower(Vector128<uint> value)
Vector64<uint> ExtractNarrowingLower(Vector128<ulong> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingLowerTest(System.Runtime.Intrinsics.Vector128`1[UInt16]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            xtn     v16.8b, v0.8h
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

22. ExtractNarrowingSaturateLower

Vector64<byte> ExtractNarrowingSaturateLower(Vector128<ushort> value)

This method saturates each element in the value vector to half the original width, stores the result into a result vector, and returns the result vector.

private Vector64<byte> ExtractNarrowingSaturateLowerTest(Vector128<ushort> value)
{
  return AdvSimd.ExtractNarrowingSaturateLower(value);
}
// value = <300, 12, 413, 514, 15, 216, 117, 618>
// Result = <255, 12, 255, 255, 15, 216, 117, 255>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> ExtractNarrowingSaturateLower(Vector128<int> value)
Vector64<int> ExtractNarrowingSaturateLower(Vector128<long> value)
Vector64<sbyte> ExtractNarrowingSaturateLower(Vector128<short> value)
Vector64<ushort> ExtractNarrowingSaturateLower(Vector128<uint> value)
Vector64<uint> ExtractNarrowingSaturateLower(Vector128<ulong> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingSaturateLowerTest(System.Runtime.Intrinsics.Vector128`1[UInt16]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            uqxtn   v16.8b, v0.8h
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

23. ExtractNarrowingSaturateScalar

Vector64<byte> ExtractNarrowingSaturateScalar(Vector64<ushort> value)

This method saturates 0th element in the value vector to half the original width, stores the result into a result vector, and returns the result vector. Other elements except 0th element are initialized to 0.

private Vector64<byte> ExtractNarrowingSaturateScalarTest(Vector64<ushort> value)
{
  return AdvSimd.Arm64.ExtractNarrowingSaturateScalar(value);
}
// value = <500, 500, 500, 500>
// Result = <255, 0, 0, 0, 0, 0, 0, 0>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector64<short> ExtractNarrowingSaturateScalar(Vector64<int> value)
Vector64<int> ExtractNarrowingSaturateScalar(Vector64<long> value)
Vector64<sbyte> ExtractNarrowingSaturateScalar(Vector64<short> value)
Vector64<ushort> ExtractNarrowingSaturateScalar(Vector64<uint> value)
Vector64<uint> ExtractNarrowingSaturateScalar(Vector64<ulong> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingSaturateScalarTest(System.Runtime.Intrinsics.Vector64`1[UInt16]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            uqxtn   b16, h0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

24. ExtractNarrowingSaturateUnsignedLower

Vector64<byte> ExtractNarrowingSaturateUnsignedLower(Vector128<short> value)

This method saturates each element (which is always signed integer value) in the value vector to an unsigned integer value that is half the original width, stores the result in a result vector, and returns the result vector. As seen in below example, the result vector element’s size byte is half as long as the input parameter value’s element’s size short.

private Vector64<byte> ExtractNarrowingSaturateUnsignedLowerTest(Vector128<short> value)
{
  return AdvSimd.ExtractNarrowingSaturateUnsignedLower(value);
}
// value = <-300, -12, 413, 514, 15, 216, 117, 618>
// Result = <0, 0, 255, 255, 15, 216, 117, 255>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<ushort> ExtractNarrowingSaturateUnsignedLower(Vector128<int> value)
Vector64<uint> ExtractNarrowingSaturateUnsignedLower(Vector128<long> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingSaturateUnsignedLowerTest(System.Runtime.Intrinsics.Vector128`1[Int16]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            sqxtun  v16.8b, v0.8h
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

25. ExtractNarrowingSaturateUnsignedScalar

Vector64<byte> ExtractNarrowingSaturateUnsignedScalar(Vector64<short> value)

This method saturates 0th element (which is always signed integer value) in the value vector to an unsigned integer value that is half the original width, stores the result in a result vector, and returns the result vector. As seen in below example, the result vector element’s size byte is half as long as the input parameter value’s element’s size short. All the other elements of result vector except 0th element is initialized to 0.

private Vector64<byte> ExtractNarrowingSaturateUnsignedScalarTest(Vector64<short> value)
{
  return AdvSimd.Arm64.ExtractNarrowingSaturateUnsignedScalar(value);
}
// value = <11, 12, 13, 14>
// Result = <11, 0, 0, 0, 0, 0, 0, 0>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector64<ushort> ExtractNarrowingSaturateUnsignedScalar(Vector64<int> value)
Vector64<uint> ExtractNarrowingSaturateUnsignedScalar(Vector64<long> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingSaturateUnsignedScalarTest(System.Runtime.Intrinsics.Vector64`1[Int16]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            sqxtun  b16, h0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

26. ExtractNarrowingSaturateUnsignedUpper

Vector128<byte> ExtractNarrowingSaturateUnsignedUpper(Vector64<byte> lower, Vector128<short> value)

This method saturates each element (which is always signed integer value) in the upper half of value vector to an unsigned integer value that is half the original width, stores the result in the upper-half of result vector, and returns the result vector, the lower-half of the result vector contains values from lower vector. As seen in below example, the result vector element’s size byte is half as long as the input parameter value’s element’s size short.

private Vector128<byte> ExtractNarrowingSaturateUnsignedUpperTest(Vector64<byte> lower, Vector128<short> value)
{
  return AdvSimd.ExtractNarrowingSaturateUnsignedUpper(lower, value);
}
// lower = <125, 12, 13, 14, 15, 216, 117, 18>
// value = <-500, 500, 12, 14, 257, 16, 17, 18>
// Result = <125, 12, 13, 14, 15, 216, 117, 18, 0, 255, 12, 14, 255, 16, 17, 18>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<ushort> ExtractNarrowingSaturateUnsignedUpper(Vector64<ushort> lower, Vector128<int> value)
Vector128<uint> ExtractNarrowingSaturateUnsignedUpper(Vector64<uint> lower, Vector128<long> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingSaturateUnsignedUpperTest(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector128`1[Int16]):System.Runtime.Intrinsics.Vector128`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )  simd16  ->   d1         HFA(simd16) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            sqxtun2 v0.16b, v1.8h
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

27. ExtractNarrowingSaturateUpper

Vector128<byte> ExtractNarrowingSaturateUpper(Vector64<byte> lower, Vector128<ushort> value)

This method saturates each element in the upper-half of value vector to half the original width, stores the result into the upper-half of result vector, and returns the result vector, the lower half of result vector containing the values from lower vector.

private Vector128<byte> ExtractNarrowingSaturateUpperTest(Vector64<byte> lower, Vector128<ushort> value)
{
  return AdvSimd.ExtractNarrowingSaturateUpper(lower, value);
}
// lower = <125, 12, 13, 14, 15, 216, 117, 18>
// value = <500, 500, 12, 14, 257, 16, 17, 18>
// Result = <125, 12, 13, 14, 15, 216, 117, 18, 255, 255, 12, 14, 255, 16, 17, 18>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<short> ExtractNarrowingSaturateUpper(Vector64<short> lower, Vector128<int> value)
Vector128<int> ExtractNarrowingSaturateUpper(Vector64<int> lower, Vector128<long> value)
Vector128<sbyte> ExtractNarrowingSaturateUpper(Vector64<sbyte> lower, Vector128<short> value)
Vector128<ushort> ExtractNarrowingSaturateUpper(Vector64<ushort> lower, Vector128<uint> value)
Vector128<uint> ExtractNarrowingSaturateUpper(Vector64<uint> lower, Vector128<ulong> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingSaturateUpperTest(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector128`1[UInt16]):System.Runtime.Intrinsics.Vector128`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )  simd16  ->   d1         HFA(simd16) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            uqxtn2  v0.16b, v1.8h
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

28. ExtractNarrowingUpper

Vector128<byte> ExtractNarrowingUpper(Vector64<byte> lower, Vector128<ushort> value)

This method narrows each element in the upper half of value vector to half the original width, stores the result in the upper half of result vector and returns the vector. The lower half of result vector contains values from lower vector. As seen in below example, the result vector element’s size byte is half as long as that of input parameter element’s size ushort.

private Vector128<byte> ExtractNarrowingUpperTest(Vector64<byte> lower, Vector128<ushort> value)
{
  return AdvSimd.ExtractNarrowingUpper(lower, value);
}
// lower = <125, 12, 13, 14, 15, 216, 117, 18>
// value = <500, 500, 12, 14, 257, 16, 17, 18>
// Result = <125, 12, 13, 14, 15, 216, 117, 18, 244, 244, 12, 14, 1, 16, 17, 18>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<short> ExtractNarrowingUpper(Vector64<short> lower, Vector128<int> value)
Vector128<int> ExtractNarrowingUpper(Vector64<int> lower, Vector128<long> value)
Vector128<sbyte> ExtractNarrowingUpper(Vector64<sbyte> lower, Vector128<short> value)
Vector128<ushort> ExtractNarrowingUpper(Vector64<ushort> lower, Vector128<uint> value)
Vector128<uint> ExtractNarrowingUpper(Vector64<uint> lower, Vector128<ulong> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractNarrowingUpperTest(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector128`1[UInt16]):System.Runtime.Intrinsics.Vector128`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )  simd16  ->   d1         HFA(simd16) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            xtn2    v0.16b, v1.8h
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

29. ExtractVector128

Vector128<byte> ExtractVector128(Vector128<byte> upper, Vector128<byte> lower, byte index)

This method extracts the vector elements from upper starting at index (and hence should be less than the size of vector) and fills the result vector. Once the upper vector runs out and there is room to fill in, elements from lower elements are picked.

private Vector128<byte> ExtractVector128Test(Vector128<byte> upper, Vector128<byte> lower, byte index)
{
  return AdvSimd.ExtractVector128(upper, lower, 5);
}
// upper = <11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26>
// lower = <31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 42, 44, 45, 46>
// index = 5
// Result = <16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 31, 32, 33, 34, 35>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<double> ExtractVector128(Vector128<double> upper, Vector128<double> lower, byte index)
Vector128<short> ExtractVector128(Vector128<short> upper, Vector128<short> lower, byte index)
Vector128<int> ExtractVector128(Vector128<int> upper, Vector128<int> lower, byte index)
Vector128<long> ExtractVector128(Vector128<long> upper, Vector128<long> lower, byte index)
Vector128<sbyte> ExtractVector128(Vector128<sbyte> upper, Vector128<sbyte> lower, byte index)
Vector128<float> ExtractVector128(Vector128<float> upper, Vector128<float> lower, byte index)
Vector128<ushort> ExtractVector128(Vector128<ushort> upper, Vector128<ushort> lower, byte index)
Vector128<uint> ExtractVector128(Vector128<uint> upper, Vector128<uint> lower, byte index)
Vector128<ulong> ExtractVector128(Vector128<ulong> upper, Vector128<ulong> lower, byte index)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractVector128Test(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )  simd16  ->   d0         HFA(simd16) 
;  V01 arg1         [V01,T01] (  3,  3   )  simd16  ->   d1         HFA(simd16) 
;* V02 arg2         [V02    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            ext     v16.16b, v0.16b, v1.16b, #5
            mov     v0.16b, v16.16b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

30. ExtractVector64

Vector64<byte> ExtractVector64(Vector64<byte> upper, Vector64<byte> lower, byte index)

This method extracts the vector elements from upper starting at index (and hence should be less than the size of vector) and fills the result vector. Once the upper vector runs out and there is room to fill in, elements from lower elements are picked. This method is same as ExtractVector128() except it operates on Vector64<T>.

private Vector64<byte> ExtractVector64Test(Vector64<byte> upper, Vector64<byte> lower, byte index)
{
  return AdvSimd.ExtractVector64(upper, lower, 5);
}
// upper = <11, 12, 13, 14, 15, 16, 17, 18>
// lower = <21, 22, 23, 24, 25, 26, 27, 28>
// index = 5
// Result = <16, 17, 18, 21, 22, 23, 24, 25>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> ExtractVector64(Vector64<short> upper, Vector64<short> lower, byte index)
Vector64<int> ExtractVector64(Vector64<int> upper, Vector64<int> lower, byte index)
Vector64<sbyte> ExtractVector64(Vector64<sbyte> upper, Vector64<sbyte> lower, byte index)
Vector64<float> ExtractVector64(Vector64<float> upper, Vector64<float> lower, byte index)
Vector64<ushort> ExtractVector64(Vector64<ushort> upper, Vector64<ushort> lower, byte index)
Vector64<uint> ExtractVector64(Vector64<uint> upper, Vector64<uint> lower, byte index)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:ExtractVector64Test(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector64`1[Byte],ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;* V02 arg2         [V02    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            ext     v16.8b, v0.8b, v1.8b, #5
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

31. Floor

Vector64<float> Floor(Vector64<float> value)

This method rounds each element in the value vector containing floating-point values to integral floating-point values of the same size using the Round towards Minus Infinity rounding mode, places the result in a vector and return the result vector. As per ARM docs, a zero input gives a zero result with the same sign, an infinite input gives an infinite result with the same sign, and a NaN is propagated as for normal arithmetic.

private Vector64<float> FloorTest(Vector64<float> value)
{
  return AdvSimd.Floor(value);
}
// value = <11.5, 12.5>
// Result = <11, 12>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<float> Floor(Vector128<float> value)

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> Floor(Vector128<double> value)

See Microsoft docs here and here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FloorTest(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            frintm  v16.2s, v0.2s
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

32. FloorScalar

Vector64<double> FloorScalar(Vector64<double> value)

This method rounds each element in the value vector containing floating-point values to integral floating-point values of the same size using the Round towards Minus Infinity rounding mode, places the result in a vector and return the result vector. As per ARM docs, a zero input gives a zero result with the same sign, an infinite input gives an infinite result with the same sign, and a NaN is propagated as for normal arithmetic.

private Vector64<double> FloorScalarTest(Vector64<double> value)
{
  return AdvSimd.FloorScalar(value);
}
// value = <11.5>
// Result = <11>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<float> FloorScalar(Vector64<float> value)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FloorScalarTest(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            frintm  d16, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

33. FusedAddHalving

Vector64<byte> FusedAddHalving(Vector64<byte> left, Vector64<byte> right)

This method adds corresponding element values from the left and right vectors, shifts each result right one bit, places the truncated results in a vector, and returns the result vector.

private Vector64<byte> FusedAddHalvingTest(Vector64<byte> left, Vector64<byte> right)
{
  return AdvSimd.FusedAddHalving(left, right);
}
// left = <11, 12, 13, 14, 15, 16, 17, 18>
// right = <21, 22, 23, 24, 25, 26, 27, 28>
// Result = <16, 17, 18, 19, 20, 21, 22, 23>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> FusedAddHalving(Vector64<short> left, Vector64<short> right)
Vector64<int> FusedAddHalving(Vector64<int> left, Vector64<int> right)
Vector64<sbyte> FusedAddHalving(Vector64<sbyte> left, Vector64<sbyte> right)
Vector64<ushort> FusedAddHalving(Vector64<ushort> left, Vector64<ushort> right)
Vector64<uint> FusedAddHalving(Vector64<uint> left, Vector64<uint> right)
Vector128<byte> FusedAddHalving(Vector128<byte> left, Vector128<byte> right)
Vector128<short> FusedAddHalving(Vector128<short> left, Vector128<short> right)
Vector128<int> FusedAddHalving(Vector128<int> left, Vector128<int> right)
Vector128<sbyte> FusedAddHalving(Vector128<sbyte> left, Vector128<sbyte> right)
Vector128<ushort> FusedAddHalving(Vector128<ushort> left, Vector128<ushort> right)
Vector128<uint> FusedAddHalving(Vector128<uint> left, Vector128<uint> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedAddHalvingTest(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector64`1[Byte]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            uhadd   v16.8b, v0.8b, v1.8b
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

34. FusedAddRoundedHalving

Vector64<byte> FusedAddRoundedHalving(Vector64<byte> left, Vector64<byte> right)

This method adds corresponding element values from the left and right vectors, shifts each result right one bit, places the rounded results in a vector, and returns the result vector.

private Vector64<byte> FusedAddRoundedHalvingTest(Vector64<byte> left, Vector64<byte> right)
{
  return AdvSimd.FusedAddRoundedHalving(left, right);
}
// left = <11, 12, 13, 14, 15, 16, 17, 18>
// right = <21, 22, 23, 24, 25, 26, 27, 28>
// Result = <16, 17, 18, 19, 20, 21, 22, 23>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<short> FusedAddRoundedHalving(Vector64<short> left, Vector64<short> right)
Vector64<int> FusedAddRoundedHalving(Vector64<int> left, Vector64<int> right)
Vector64<sbyte> FusedAddRoundedHalving(Vector64<sbyte> left, Vector64<sbyte> right)
Vector64<ushort> FusedAddRoundedHalving(Vector64<ushort> left, Vector64<ushort> right)
Vector64<uint> FusedAddRoundedHalving(Vector64<uint> left, Vector64<uint> right)
Vector128<byte> FusedAddRoundedHalving(Vector128<byte> left, Vector128<byte> right)
Vector128<short> FusedAddRoundedHalving(Vector128<short> left, Vector128<short> right)
Vector128<int> FusedAddRoundedHalving(Vector128<int> left, Vector128<int> right)
Vector128<sbyte> FusedAddRoundedHalving(Vector128<sbyte> left, Vector128<sbyte> right)
Vector128<ushort> FusedAddRoundedHalving(Vector128<ushort> left, Vector128<ushort> right)
Vector128<uint> FusedAddRoundedHalving(Vector128<uint> left, Vector128<uint> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedAddRoundedHalvingTest(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector64`1[Byte]):System.Runtime.Intrinsics.Vector64`1[Byte]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            urhadd  v16.8b, v0.8b, v1.8b
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

35. FusedMultiplyAdd

Vector64<float> FusedMultiplyAdd(Vector64<float> addend, Vector64<float> left, Vector64<float> right)

This method multiplies corresponding floating-point values in the vectors in the left and right vectors, adds the product to the vector elements of the addened vector, and returns the accumulated result vector.

private Vector64<float> FusedMultiplyAddTest(Vector64<float> addend, Vector64<float> left, Vector64<float> right)
{
  return AdvSimd.FusedMultiplyAdd(addend, left, right);
}
// addend = <11.5, 12.5>
// left = <21.5, 22.5>
// right = <11.5, 12.5>
// Result = <258.75, 293.75>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<float> FusedMultiplyAdd(Vector128<float> addend, Vector128<float> left, Vector128<float> right)

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> FusedMultiplyAdd(Vector128<double> addend, Vector128<double> left, Vector128<double> right)

See Microsoft docs here and here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplyAddTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmla    v0.2s, v1.2s, v2.2s
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

36. FusedMultiplyAddByScalar

Vector64<float> FusedMultiplyAddByScalar(Vector64<float> addend, Vector64<float> left, Vector64<float> right)

This method multiplies floating-point value element at 0th index of right vector with elements in the left vector, adds the product to the vector elements of the addened vector, and returns the accumulated result vector.

private Vector64<float> FusedMultiplyAddByScalarTest(Vector64<float> addend, Vector64<float> left, Vector64<float> right)
{
  return AdvSimd.Arm64.FusedMultiplyAddByScalar(addend, left, right);
}
// addend = <11.5, 12.5>
// left = <21.5, 22.5>
// right = <11.5, 12.5>
// Result = <258.75, 271.25>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> FusedMultiplyAddByScalar(Vector128<double> addend, Vector128<double> left, Vector64<double> right)
Vector128<float> FusedMultiplyAddByScalar(Vector128<float> addend, Vector128<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplyAddByScalarTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmla    v0.2s, v1.2s, v2.s[0]
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

37. FusedMultiplyAddBySelectedScalar

Vector64<float> FusedMultiplyAddBySelectedScalar(Vector64<float> addend, Vector64<float> left, Vector64<float> right, byte rightIndex)

This method multiplies floating-point value element at rightIndex index of right vector with elements in the left vector, adds the product to the vector elements of the addened vector, and returns the accumulated result vector.

private Vector64<float> FusedMultiplyAddBySelectedScalarTest(Vector64<float> addend, Vector64<float> left, Vector64<float> right, byte rightIndex)
{
  return AdvSimd.Arm64.FusedMultiplyAddBySelectedScalar(addend, left, right, 0);
}
// addend = <11.5, 12.5>
// left = <21.5, 22.5>
// right = <11.5, 12.5>
// rightIndex = 0
// Result = <258.75, 271.25>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector64<float> FusedMultiplyAddBySelectedScalar(Vector64<float> addend, Vector64<float> left, Vector128<float> right, byte rightIndex)
Vector128<double> FusedMultiplyAddBySelectedScalar(Vector128<double> addend, Vector128<double> left, Vector128<double> right, byte rightIndex)
Vector128<float> FusedMultiplyAddBySelectedScalar(Vector128<float> addend, Vector128<float> left, Vector64<float> right, byte rightIndex)
Vector128<float> FusedMultiplyAddBySelectedScalar(Vector128<float> addend, Vector128<float> left, Vector128<float> right, byte rightIndex)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplyAddBySelectedScalarTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],ubyte):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;* V03 arg3         [V03    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmla    v0.2s, v1.2s, v2.s[0]
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

38. FusedMultiplyAddNegatedScalar

Vector64<double> FusedMultiplyAddNegatedScalar(Vector64<double> addend, Vector64<double> left, Vector64<double> right)

This method multiplies the values of the left and right vector, negates the product, subtracts the value of theaddend vector from the product, and returns the result.

private Vector64<double> FusedMultiplyAddNegatedScalarTest(Vector64<double> addend, Vector64<double> left, Vector64<double> right)
{
  return AdvSimd.FusedMultiplyAddNegatedScalar(addend, left, right);
}
// addend = <100.5>
// left = <5.5>
// right = <15.5>
// Result = <-185.75>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<float> FusedMultiplyAddNegatedScalar(Vector64<float> addend, Vector64<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplyAddNegatedScalarTest(System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fnmadd  d16, d1, d2, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

39. FusedMultiplyAddScalar

Vector64<double> FusedMultiplyAddScalar(Vector64<double> addend, Vector64<double> left, Vector64<double> right)

This method multiplies corresponding floating-point values in the vectors in the left and right vectors, adds the product to the vector elements of the addened vector, and returns the accumulated result vector.

private Vector64<double> FusedMultiplyAddScalarTest(Vector64<double> addend, Vector64<double> left, Vector64<double> right)
{
  return AdvSimd.FusedMultiplyAddScalar(addend, left, right);
}
// addend = <100.5>
// left = <5.5>
// right = <15.5>
// Result = <185.75>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<float> FusedMultiplyAddScalar(Vector64<float> addend, Vector64<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplyAddScalarTest(System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmadd   d16, d1, d2, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

40. FusedMultiplyAddScalarBySelectedScalar

Vector64<double> FusedMultiplyAddScalarBySelectedScalar(Vector64<double> addend, Vector64<double> left, Vector128<double> right, byte rightIndex)

This method multiplies the vector elements in the left vector by an element at rightIndex of the right vector, and accumulates the product to the corresponding vector elements of the addend vector and returns the result vector.

private Vector64<double> FusedMultiplyAddScalarBySelectedScalarTest(Vector64<double> addend, Vector64<double> left, Vector128<double> right, byte rightIndex)
{
  return AdvSimd.Arm64.FusedMultiplyAddScalarBySelectedScalar(addend, left, right, 0);
}
// addend = <11.5>
// left = <11.5>
// right = <11.5, 12.5>
// rightIndex = 0
// Result = <143.75>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector64<float> FusedMultiplyAddScalarBySelectedScalar(Vector64<float> addend, Vector64<float> left, Vector64<float> right, byte rightIndex)
Vector64<float> FusedMultiplyAddScalarBySelectedScalar(Vector64<float> addend, Vector64<float> left, Vector128<float> right, byte rightIndex)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplyAddScalarBySelectedScalarTest(System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )  simd16  ->   d2         HFA(simd16) 
;* V03 arg3         [V03    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmla    d0, d1, v2.d[0]
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

41. FusedMultiplySubtract

Vector64<float> FusedMultiplySubtract(Vector64<float> minuend, Vector64<float> left, Vector64<float> right)

This method multiplies corresponding floating-point values in the vectors in the left and right vectors, negates the product, adds the product to the corresponding vector element of minuend vector, and returns the result.

private Vector64<float> FusedMultiplySubtractTest(Vector64<float> minuend, Vector64<float> left, Vector64<float> right)
{
  return AdvSimd.FusedMultiplySubtract(minuend, left, right);
}
// minuend = <11.5, 12.5>
// left = <21.5, 22.5>
// right = <11.5, 12.5>
// Result = <-235.75, -268.75>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector128<float> FusedMultiplySubtract(Vector128<float> minuend, Vector128<float> left, Vector128<float> right)

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> FusedMultiplySubtract(Vector128<double> minuend, Vector128<double> left, Vector128<double> right)

See Microsoft docs here and here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplySubtractTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmls    v0.2s, v1.2s, v2.2s
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

42. FusedMultiplySubtractByScalar

Vector64<float> FusedMultiplySubtractByScalar(Vector64<float> minuend, Vector64<float> left, Vector64<float> right)

This method multiplies floating-point value element at 0th index of right vector with elements in the left vector, negates the product, adds the product to the vector elements of the minuend vector, and returns the accumulated result vector.

private Vector64<float> FusedMultiplySubtractByScalarTest(Vector64<float> minuend, Vector64<float> left, Vector64<float> right)
{
  return AdvSimd.Arm64.FusedMultiplySubtractByScalar(minuend, left, right);
}
// minuend = <11.5, 12.5>
// left = <21.5, 22.5>
// right = <11.5, 12.5>
// Result = <-235.75, -246.25>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector128<double> FusedMultiplySubtractByScalar(Vector128<double> minuend, Vector128<double> left, Vector64<double> right)
Vector128<float> FusedMultiplySubtractByScalar(Vector128<float> minuend, Vector128<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplySubtractByScalarTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmls    v0.2s, v1.2s, v2.s[0]
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

43. FusedMultiplySubtractBySelectedScalar

Vector64<float> FusedMultiplySubtractBySelectedScalar(Vector64<float> minuend, Vector64<float> left, Vector64<float> right, byte rightIndex)

This method multiplies floating-point value element at rightIndex index of right vector with elements in the left vector, negates the product, adds the product to the vector elements of the minuend vector, and returns the accumulated result vector.

private Vector64<float> FusedMultiplySubtractBySelectedScalarTest(Vector64<float> minuend, Vector64<float> left, Vector64<float> right, byte rightIndex)
{
  return AdvSimd.Arm64.FusedMultiplySubtractBySelectedScalar(minuend, left, right, 0);
}
// minuend = <11.5, 12.5>
// left = <21.5, 22.5>
// right = <11.5, 12.5>
// rightIndex = 0
// Result = <-235.75, -246.25>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd.Arm64
Vector64<float> FusedMultiplySubtractBySelectedScalar(Vector64<float> minuend, Vector64<float> left, Vector128<float> right, byte rightIndex)
Vector128<double> FusedMultiplySubtractBySelectedScalar(Vector128<double> minuend, Vector128<double> left, Vector128<double> right, byte rightIndex)
Vector128<float> FusedMultiplySubtractBySelectedScalar(Vector128<float> minuend, Vector128<float> left, Vector64<float> right, byte rightIndex)
Vector128<float> FusedMultiplySubtractBySelectedScalar(Vector128<float> minuend, Vector128<float> left, Vector128<float> right, byte rightIndex)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplySubtractBySelectedScalarTest(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],ubyte):System.Runtime.Intrinsics.Vector64`1[Single]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;* V03 arg3         [V03    ] (  0,  0   )   ubyte  ->  zero-ref   
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmls    v0.2s, v1.2s, v2.s[0]
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 20, prolog size 8

44. FusedMultiplySubtractNegatedScalar

Vector64<double> FusedMultiplySubtractNegatedScalar(Vector64<double> minuend, Vector64<double> left, Vector64<double> right)

This method multiplies the values of the left and right vectors, subtracts the value of the minuend vector, and returns the result.

private Vector64<double> FusedMultiplySubtractNegatedScalarTest(Vector64<double> minuend, Vector64<double> left, Vector64<double> right)
{
  return AdvSimd.FusedMultiplySubtractNegatedScalar(minuend, left, right);
}
// minuend = <11.5>
// left = <11.5>
// right = <11>
// Result = <115>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<float> FusedMultiplySubtractNegatedScalar(Vector64<float> minuend, Vector64<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplySubtractNegatedScalarTest(System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fnmsub  d16, d1, d2, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8

45. FusedMultiplySubtractScalar

Vector64<double> FusedMultiplySubtractScalar(Vector64<double> minuend, Vector64<double> left, Vector64<double> right)

This method multiplies the values of theleft and right vectors, negates the product, adds that to the value of the minuend vector, and returns the result.

private Vector64<double> FusedMultiplySubtractScalarTest(Vector64<double> minuend, Vector64<double> left, Vector64<double> right)
{
  return AdvSimd.FusedMultiplySubtractScalar(minuend, left, right);
}
// minuend = <11.5>
// left = <11.5>
// right = <11>
// Result = <-115>

Similar APIs that operate on different sizes:

// class System.Runtime.Intrinisics.AdvSimd
Vector64<float> FusedMultiplySubtractScalar(Vector64<float> minuend, Vector64<float> left, Vector64<float> right)

See Microsoft docs here, ARM docs here.

Assembly generated:

; Assembly listing for method AdvSimdMethods:FusedMultiplySubtractScalarTest(System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double],System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
;
;  V00 arg0         [V00,T00] (  3,  3   )   simd8  ->   d0         HFA(simd8) 
;  V01 arg1         [V01,T01] (  3,  3   )   simd8  ->   d1         HFA(simd8) 
;  V02 arg2         [V02,T02] (  3,  3   )   simd8  ->   d2         HFA(simd8) 
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
; Lcl frame size = 0
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
            fmsub   d16, d1, d2, d0
            mov     v0.8b, v16.8b
            ldp     fp, lr, [sp],#16
            ret     lr

; Total bytes of code 24, prolog size 8